Unboxing the blackbox: Convolutions

Table of contents:

Today, we will try to implement our own convolution algorithm and convolution layer, to really see what's going on. I'll assume you already know what convolutions are and how do CNNs work. To brush up on the ideas real quick, check out these 2 videos from Udacity's "Deep Learning Nanodegree program":

In [1]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('RnM1D-XI--8', width=800, height=450)
Out[1]:
In [2]:
YouTubeVideo('yfPEROi3SPU', width=800, height=450)
Out[2]:

State of the art

Let's see what's the state of the art convolution performance is:

In [1]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import torch.nn as nn
import torch.nn.functional as F
%load_ext memory_profiler
In [2]:
def openImg(url):
    return torch.Tensor(np.asarray(Image.open(url))).permute(2, 0, 1)

This is to demonstrate the Image.open() function used above works:

In [3]:
rick = Image.open('rick_sanchez.jpg'); rick
Out[3]:

Here's the image as a tensor:

In [4]:
rick = openImg('rick_sanchez.jpg').cuda(); rick.shape
Out[4]:
torch.Size([3, 860, 1600])

As you can see, the dimensions we will we working with are (#channels, height, width). Let's also define some kernels:

In [5]:
identity = torch.Tensor([[0, 0, 0],
                         [0, 1, 0],
                         [0, 0, 0]]).cuda()
emboss = torch.Tensor([[-2, -1, 0],
                       [-1, 1, 1],
                       [0, 1, 2]]).cuda()
sharpen = torch.Tensor([[0, -1, 0],
                        [-1, 5, -1],
                        [0, -1, 0]]).cuda()
edgeDetect = torch.Tensor([[0, 1, 0],
                           [1, -4, 1],
                           [0, 1, 0]]).cuda()

From the documentation, it expects the image to have dimension (#samples, #input channels, image height, image width) and the kernel to have dimension (#output channels, #input channels, image height, image width): image.png We're throwing away the groups idea to make things simpler as I've never sort of used it. Let's test it out:

In [7]:
rick1 = rick.unsqueeze(0).cuda(); print(f"Image dimension: {rick1.shape}")
kernel = torch.cuda.FloatTensor(3, 3, 3, 3).fill_(0)
kernel[0, 0] = emboss # in red to out red channel
kernel[1, 1] = emboss # in green to out green channel
kernel[2, 2] = emboss # in blue to out blue channel
%time transformed = F.conv2d(rick1, kernel)
print(f"Transformed image dimension: {transformed.shape}")
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(transformed[0].permute(1, 2, 0).cpu()/256); pass
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Image dimension: torch.Size([1, 3, 860, 1600])
CPU times: user 438 µs, sys: 54 µs, total: 492 µs
Wall time: 423 µs
Transformed image dimension: torch.Size([1, 3, 858, 1598])

Seems to be about right. Also note the performance, it usually takes 400$\mu s$ - 1200$\mu s$. That's super impressive.

Vanilla implementation

Let's strip away all the details and just consider a 3 channel image, with a 2 dimensional kernel. Let's try to do this using vanilla Python:

In [124]:
rick1 = rick.cpu(); print(f"Image dimension: {rick1.shape}")
kernel = emboss
# expects image to be (#channel, height, width), and kernel to have 2 dimensions
def conv(image, kernel):
    _, height, width = rick1.shape
    transformed = torch.zeros(3, height-2, width-2)
    for pixelY in range(0, height-2):
        print(f"\rProgress: {np.round(10000*pixelY/height)/100.0}%               ", end="")
        for pixelX in range(0, width-2):
            red, green, blue = 0, 0, 0
            for kernelX in range(3):
                for kernelY in range(3):
                    red += kernel[kernelY, kernelX] * rick1[0, pixelY+kernelY, pixelX+kernelX]
                    green += kernel[kernelY, kernelX] * rick1[1, pixelY+kernelY, pixelX+kernelX]
                    blue += kernel[kernelY, kernelX] * rick1[2, pixelY+kernelY, pixelX+kernelX]
            transformed[0, pixelY-1, pixelX-1] = red
            transformed[1, pixelY-1, pixelX-1] = green
            transformed[2, pixelY-1, pixelX-1] = blue
    print()
    return transformed
%time transformed = conv(rick1, kernel)
Image dimension: torch.Size([3, 860, 1600])
Progress: 99.65%               
CPU times: user 20min 30s, sys: 1.9 s, total: 20min 32s
Wall time: 20min 33s
In [125]:
print(f"Transformed image dimension: {transformed.shape}")
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(transformed.permute(1, 2, 0)/256)
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Transformed image dimension: torch.Size([3, 858, 1598])
Out[125]:
<matplotlib.image.AxesImage at 0x7fc323300090>

Okay, that's sort of unbearable. 20 minutes and 30s, or 1230s. More than 2.4 million times slower. This is completely unacceptable. Why does this happen? There're a couple of reasons.

First of all, it is written in Python, a dynamic, interpreted language. This means that variables don't have to explicitly declare type information and the programmer can just loosely tell Python what to do and it will do it. This also means that it takes a whole bunch of time to interpret the code on the fly. It also takes time to check the type of each variable before doing anything with it.

Second, Python has something called "Global Interpreter Lock" that ensures only 1 thread can run at a given point. My computer has 8 cores, but vanilla Python can't really utilize more than 1. Compared to the convolution function provided by PyTorch where they can utilize something like 600 cores on a GPU, all running very optimized compiled C code, sometimes even a little assembly, vanilla Python is no match.

So it seems like we can't really do this right? We can't really write C code in Python, and even if we can, it'll just be so stupid hard to make it work that it won't worth the extra insight we can gain.

Speeding things up: reduced dimensionality, 3x3 kernel

Actually, we can do this. I have actually managed to achieve 2 times slower speed than state of the art using 10 lines of Python. The general mentality is to convert the vanilla Python solution to tensors and operations on them. Let's see how to do that.

Let's set things up first:

In [249]:
rick1 = rick[0]; print(f"Image dimension (red channel): {rick1.shape}")
kernel = emboss; print(f"Kernel dimension: {kernel.shape}")
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(rick1.cpu()); rick1.mean() # to prove the colors are from 0-255
Image dimension (red channel): torch.Size([860, 1600])
Kernel dimension: torch.Size([3, 3])
Out[249]:
tensor(74.3805, device='cuda:0')

How are convolutions calculated? If we're just focusing on 1 channel at a time, then for each output pixel, we have to multiply the kernel with the pixels surrounding it and add them together. In the vanilla implementation, we loop through every pixel, then we loop through all the kernel items. The loop through every pixel part seems like a good target for utilizing tensors, because there's a lot of looping going on compared to the kernel. So a possible solution is to multiply the image with an item in the kernel in the GPU, then repeat this for every item in the kernel with a simple loop. How will this look like?

Let's define 9 images from a11 to a33, each is the base image multiplied with a kernel item:

In [250]:
print(f"Kernel: \n{kernel}")
a11 = rick1 * kernel[0, 0]
a12 = rick1 * kernel[0, 1]
a13 = rick1 * kernel[0, 2]
a21 = rick1 * kernel[1, 0]
a22 = rick1 * kernel[1, 1]
a23 = rick1 * kernel[1, 2]
a31 = rick1 * kernel[2, 0]
a32 = rick1 * kernel[2, 1]
a33 = rick1 * kernel[2, 2]; print(f"a33's shape: {a33.shape}")
Kernel: 
tensor([[-2., -1.,  0.],
        [-1.,  1.,  1.],
        [ 0.,  1.,  2.]], device='cuda:0')
a33's shape: torch.Size([860, 1600])

What happens if we just add together all of these images? Should it work? Let's see:

In [251]:
a = a11 + a12 + a13 + a21 + a22 + a23 + a31 + a32 + a33
a.mean()
Out[251]:
tensor(74.3805, device='cuda:0')

The mean looks reasonable. Let's display it:

In [252]:
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(a.cpu())
Out[252]:
<matplotlib.image.AxesImage at 0x7f620c0106d0>

Doesn't look like the emboss effect is applied. Let's do it again, but with edge detect to make sure:

In [253]:
rick1 = rick[0]; kernel = edgeDetect
a = rick1 * kernel[0, 0] + rick1 * kernel[0, 1] + rick1 * kernel[0, 2] + rick1 * kernel[1, 0] + rick1 * kernel[1, 1] + rick1 * kernel[1, 2] + rick1 * kernel[2, 0] + rick1 * kernel[2, 1] + rick1 * kernel[2, 2]
plt.imshow(a.cpu())
a.mean()
Out[253]:
tensor(0., device='cuda:0')

Yep, something is off here. No edge detection whatsoever, and the mean pixel value is 0. What are we doing with the operation above? We're multiplying each image with a number in the kernel, then add them together. We can do the same thing by just summing all items in the kernel, then multiply them with the image. Sum of emboss is 1, and sum of edge detect is 0, so it makes sense.

No, we have to translate each image a little bit, then add all of them together. We can do this using the roll() function:

In [254]:
a = torch.Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]]); display(a)
display(a.roll((1), (0)))
display(a.roll((1, 1), (0, 1)))
tensor([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]])
tensor([[7., 8., 9.],
        [1., 2., 3.],
        [4., 5., 6.]])
tensor([[9., 7., 8.],
        [3., 1., 2.],
        [6., 4., 5.]])

You can see what it does to the tensor. The first argument is a tuple with the integer shifts, and the second argument is a tuple with the dimension number.

In [255]:
rick1 = rick[0]; print(f"Image dimension (red channel): {rick1.shape}")
kernel = emboss; print(f"Kernel dimension: {kernel.shape}")
a11 = (rick1 * kernel[0, 0]).roll((1, 1), (1, 0))
a12 = (rick1 * kernel[0, 1]).roll((0, 1), (1, 0))
a13 = (rick1 * kernel[0, 2]).roll((-1, 1), (1, 0))
a21 = (rick1 * kernel[1, 0]).roll((1, 0), (1, 0))
a22 = (rick1 * kernel[1, 1]).roll((0, 0), (1, 0))
a23 = (rick1 * kernel[1, 2]).roll((-1, 0), (1, 0))
a31 = (rick1 * kernel[2, 0]).roll((1, -1), (1, 0))
a32 = (rick1 * kernel[2, 1]).roll((0, -1), (1, 0))
a33 = (rick1 * kernel[2, 2]).roll((-1, -1), (1, 0))
a = a11 + a12 + a13 + a21 + a22 + a23 + a31 + a32 + a33
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(a.cpu())
a.mean()
Image dimension (red channel): torch.Size([860, 1600])
Kernel dimension: torch.Size([3, 3])
Out[255]:
tensor(74.3805, device='cuda:0')

Seems correct. How about the version where we have 3 channels?

In [256]:
rick1 = rick; print(f"Image dimension: {rick1.shape}")
kernel = emboss; print(f"Kernel dimension: {kernel.shape}")
a11 = (rick1 * kernel[0, 0]).roll((1, 1), (2, 1))
a12 = (rick1 * kernel[0, 1]).roll((0, 1), (2, 1))
a13 = (rick1 * kernel[0, 2]).roll((-1, 1), (2, 1))
a21 = (rick1 * kernel[1, 0]).roll((1, 0), (2, 1))
a22 = (rick1 * kernel[1, 1]).roll((0, 0), (2, 1))
a23 = (rick1 * kernel[1, 2]).roll((-1, 0), (2, 1))
a31 = (rick1 * kernel[2, 0]).roll((1, -1), (2, 1))
a32 = (rick1 * kernel[2, 1]).roll((0, -1), (2, 1))
a33 = (rick1 * kernel[2, 2]).roll((-1, -1), (2, 1))
a = a11 + a12 + a13 + a21 + a22 + a23 + a31 + a32 + a33
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(a.permute(1, 2, 0).cpu()/256); pass
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Image dimension: torch.Size([3, 860, 1600])
Kernel dimension: torch.Size([3, 3])

Nice. Let's bunch it in a function and measure its performance.

In [33]:
rick1 = rick; print(f"Image dimension: {rick1.shape}")
kernel = emboss; print(f"Kernel dimension: {kernel.shape}")
def conv2(img, kernel):
    a11 = (img * kernel[0, 0]).roll((1, 1), (2, 1))
    a12 = (img * kernel[0, 1]).roll((0, 1), (2, 1))
    a13 = (img * kernel[0, 2]).roll((-1, 1), (2, 1))
    a21 = (img * kernel[1, 0]).roll((1, 0), (2, 1))
    a22 = (img * kernel[1, 1]).roll((0, 0), (2, 1))
    a23 = (img * kernel[1, 2]).roll((-1, 0), (2, 1))
    a31 = (img * kernel[2, 0]).roll((1, -1), (2, 1))
    a32 = (img * kernel[2, 1]).roll((0, -1), (2, 1))
    a33 = (img * kernel[2, 2]).roll((-1, -1), (2, 1))
    return a11 + a12 + a13 + a21 + a22 + a23 + a31 + a32 + a33
%time transformed = conv2(rick1, kernel)
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(transformed.permute(1, 2, 0).cpu()/256); pass
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Image dimension: torch.Size([3, 860, 1600])
Kernel dimension: torch.Size([3, 3])
CPU times: user 1.13 ms, sys: 111 µs, total: 1.24 ms
Wall time: 838 µs

Holy shit, did you see that? Now it only takes 838$\mu s$. This is very close to the state of the art.

Speeding things up: complete dimensionality, 3x3 kernel

Now let's try to bring this over to the definitions for F.conv2d(). The function below takes in a (3, 3) kernel, and spit out a (3, 3, 3, 3) kernel, taken from the example way at the beginning

In [10]:
def advancedKernel(inKernel):
    kernel = torch.cuda.FloatTensor(3, 3, 3, 3).fill_(0)
    kernel[0, 0] = inKernel # in red to out red channel
    kernel[1, 1] = inKernel # in green to out green channel
    kernel[2, 2] = inKernel # in blue to out blue channel
    return kernel
emboss1 = advancedKernel(emboss)
print(f"Emboss shape: {emboss1.shape}")
emboss1
Emboss shape: torch.Size([3, 3, 3, 3])
Out[10]:
tensor([[[[-2., -1.,  0.],
          [-1.,  1.,  1.],
          [ 0.,  1.,  2.]],

         [[ 0.,  0.,  0.],
          [ 0.,  0.,  0.],
          [ 0.,  0.,  0.]],

         [[ 0.,  0.,  0.],
          [ 0.,  0.,  0.],
          [ 0.,  0.,  0.]]],


        [[[ 0.,  0.,  0.],
          [ 0.,  0.,  0.],
          [ 0.,  0.,  0.]],

         [[-2., -1.,  0.],
          [-1.,  1.,  1.],
          [ 0.,  1.,  2.]],

         [[ 0.,  0.,  0.],
          [ 0.,  0.,  0.],
          [ 0.,  0.,  0.]]],


        [[[ 0.,  0.,  0.],
          [ 0.,  0.,  0.],
          [ 0.,  0.,  0.]],

         [[ 0.,  0.,  0.],
          [ 0.,  0.,  0.],
          [ 0.,  0.,  0.]],

         [[-2., -1.,  0.],
          [-1.,  1.,  1.],
          [ 0.,  1.,  2.]]]], device='cuda:0')

Now to the image:

In [260]:
rick1 = rick.unsqueeze(0); print(f"Image dimension: {rick1.shape}")
Image dimension: torch.Size([1, 3, 860, 1600])

Now on to the function. This looks familiar to the function conv2() we defined above, but there are more moving parts:

In [31]:
def conv3(img, kernel):
    img = img.permute(0, 2, 3, 1)
    kernel = kernel.permute(2, 3, 1, 0)
    a11 = (img @ kernel[0, 0]).roll((1, 1), (2, 1))
    a12 = (img @ kernel[0, 1]).roll((0, 1), (2, 1))
    a13 = (img @ kernel[0, 2]).roll((-1, 1), (2, 1))
    a21 = (img @ kernel[1, 0]).roll((1, 0), (2, 1))
    a22 = (img @ kernel[1, 1]).roll((0, 0), (2, 1))
    a23 = (img @ kernel[1, 2]).roll((-1, 0), (2, 1))
    a31 = (img @ kernel[2, 0]).roll((1, -1), (2, 1))
    a32 = (img @ kernel[2, 1]).roll((0, -1), (2, 1))
    a33 = (img @ kernel[2, 2]).roll((-1, -1), (2, 1))
    return (a11 + a12 + a13 + a21 + a22 + a23 + a31 + a32 + a33).permute(0, 3, 1, 2)
kernel1 = advancedKernel(edgeDetect)
%time transformed = conv3(rick1, kernel1)
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow((transformed[0].permute(1, 2, 0)/256).cpu())
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
CPU times: user 792 µs, sys: 82 µs, total: 874 µs
Wall time: 782 µs
Out[31]:
<matplotlib.image.AxesImage at 0x7f9ecefbb910>

Seems like it's working fine. Let's unpack what's going on here. The image's shape is:

(#samples, #channels, iHeight, iWidth), with prefix "i" stands for "image". The kernel's shape is:

(#out channels, #in channels, kHeight, kWidth), with prefix "k" stands for "kernel".

Let's think about the core operation we did in function conv2(): img * kernel[0, 0]

The picking out specific kernel items part, actually picks from the last 2 dimension of kHeight and kWidth. Thus, in function conv3(), kernel[0, 0] is a tensor with shape (#out channels, #in channels). This looks too familiar with the common matrix operation. Normally, an (a, b) weight tensor "translates" from samples of a features to samples of b features. Here, a beginning image has a channels, and the output image has b channels, and the kernel now really looks like a simple transformation. But the dimension is reversed, so we have to switch the channels positions using permute. We also want to be able to pick out a 2 dimensional kernel with 2 beginning indexes, just for decoration. Therefore, the correct permutation should be kernel.permute(2, 3, 1, 0)

Same thing as the above, now that the input and output channels are distinct, we have to build that into the image tensor itself. It seems quite convenient if the channel dimension part is the last part, so we can just use a normal matrix multiplication on the image and the kernel tensors. So the correct image permutation should be img = img.permute(0, 2, 3, 1). After that, we can just roll and it's fine again.

And the performance is not too bad: 782$\mu s$. Surprisingly, this is pretty much the same performance as our conv2() function above. I still don't understand why this is the case, as the image and kernel is larger and more complex, and there's a matrix multiplication instead of a simple pairwise multiplication. All in all, I think this is a success. Now let's try to make the convolution better.

Speeding things up: complete dimensionality, custom kernel dimension

In [15]:
def conv4(imgs, kernel):
    imgs = imgs.permute(0, 2, 3, 1)
    kernel = kernel.permute(2, 3, 1, 0)
    ks, inChannels, outChannels = kernel.shape[0], kernel.shape[2], kernel.shape[3]
    padding = ks - 1
    samples, height, width, _ = imgs.shape
    transformed = torch.cuda.FloatTensor(samples, height-padding, width-padding, outChannels).zero_()
    for yKernel in range(ks):
        for xKernel in range(ks):
            transformed += (imgs @ kernel[yKernel, xKernel])[:, yKernel:height-padding+yKernel, xKernel:width-padding+xKernel, :]
    return transformed.permute(0, 3, 1, 2)

Here, the idea is sort of similar, but with a little more complexity, just to accomodate for changing kernel size. Also, we're not using roll() function anymore, and we will just sample smaller images directly from the tensor.

In [25]:
kernel1 = advancedKernel(emboss)
rick1 = rick.unsqueeze(0)
%time transformed = conv4(rick1, kernel1)
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(transformed[0].permute(1, 2, 0).cpu()/256); pass
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
CPU times: user 675 µs, sys: 89 µs, total: 764 µs
Wall time: 769 µs

The performance here is even better, at 769 $\mu s$, and I still don't understand why it's like that.

Implementing nn.Conv2d

The idea for this is simple enough. We already have the kernel, ready to be passed in. So we can just create kernels with the correct size. Let's setup everything first, so we get where we're going with this:

In [16]:
import torchvision.datasets as datasets
import torchvision
import torch.optim as optim
import time

Here's our data loaders. I'm pretty lazy to create my own custom dataset class, so I'm just gonna use PyTorch's default datasets:

In [20]:
dl = torch.utils.data.DataLoader(datasets.MNIST("datasets/", download=True,
                                 transform=torchvision.transforms.Compose([
                                     torchvision.transforms.ToTensor(),
                                     torchvision.transforms.Normalize((0.1307,), (0.3081,))
                                 ])), batch_size=1000, shuffle=True)
dl_test = torch.utils.data.DataLoader(datasets.MNIST("datasets/", download=True, train=False,
                                 transform=torchvision.transforms.Compose([
                                     torchvision.transforms.ToTensor(),
                                     torchvision.transforms.Normalize((0.1307,), (0.3081,))
                                 ])), batch_size=1000, shuffle=True)

Then here's our network. Pretty straightforward, all components are standard for a CNN. Granted, you don't need to build a CNN for MNIST because it's a bit overkill, but we're still gonna do it to test out our convolution module. Also note we can quite easily slip in different convolution module:

In [17]:
class Net(nn.Module):
    def __init__(self, convModule):
        super().__init__()
        self.conv1 = convModule(1, 3, 3)
        self.conv2 = convModule(3, 3, 3)
        self.conv3 = convModule(3, 6, 3)
        self.conv4 = convModule(6, 6, 3)
        self.pool = nn.MaxPool2d(2)
        self.relu = nn.ReLU()
        self.logSoftmax = nn.LogSoftmax(1)
        self.fc1 = nn.Linear(96, 30)
        self.fc2 = nn.Linear(30, 10)
        self.dropout = nn.Dropout(0.2)
    def forward(self, x):
        x = self.relu(self.conv1(x))
        x = self.relu(self.conv2(x))
        x = self.pool(x)
        x = self.relu(self.conv3(x))
        x = self.relu(self.conv4(x))
        x = self.pool(x)
        x = self.dropout(x.contiguous().view(-1, 6 * 4 * 4))
        x = self.dropout(self.relu(self.fc1(x)))
        return self.logSoftmax(self.fc2(x))

Interesting fact: it seems like nn.Conv2d outputs a nicely binary-formatted (aka "contiguous", but that word is sort of a misnomer). It's either that their internal workings are really clean, or they just call contiguous() at the end. Anyway, the custom one we've built don't have that nice binary format, so we have to call contiguous() anyway.

Below is the convolution module. As you can see, we are initializing a kernel, and a bias for each output channels. The initialization is involved, with very specific initialization schema, but the actual calculation is basically just conv4() above:

In [18]:
class Conv2d(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3):
        super().__init__()
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        kSqrt = 1/np.sqrt(in_channels*kernel_size**2)
        self.kernel = nn.Parameter(torch.rand(out_channels, in_channels, kernel_size, kernel_size)*2*kSqrt-kSqrt)
        self.bias = nn.Parameter(torch.rand(out_channels).unsqueeze(-1).unsqueeze(-1)*2*kSqrt-kSqrt)
    def forward(self, imgs):
        return conv4(imgs, self.kernel) + self.bias

And here's our training function. Again, pretty straightforward, and this will return graphable stuff, to visualize the performance better:

In [21]:
def train(net, dl=dl, dl_test=dl_test):
    lossFunction= nn.NLLLoss()
    optimizer = optim.Adam(net.parameters(), lr=0.001)
    losses = []
    accuracies = []
    times = []
    count = 0
    initialTime = time.time()
    for epoch in range(5):
        for imgs, labels in dl:
            count += 1
            optimizer.zero_grad()
            imgs, labels = imgs.cuda(), labels.cuda()
            output = net(imgs)
            loss = lossFunction(output, labels)
            if count % 10 == 0:
                test_imgs, test_labels = next(iter(dl_test))
                accuracies.append((torch.argmax(net(test_imgs.cuda()), dim=1) == test_labels.cuda()).sum())
                times.append(int(time.time()-initialTime))
                losses.append(loss.item())
                print(f"\rLoss: {losses[-1]}, accuracy: {accuracies[-1]}/1000    ", end="")
            loss.backward()
            optimizer.step()
    return times, losses, accuracies

And here's actually training it:

In [156]:
times1, losses1, acc1 = train(Net1(nn.Conv2d).cuda())
times2, losses2, acc2 = train(Net(Conv2d).cuda())
Loss: 0.3227958381175995, accuracy: 908/1000     

Let's see how they fair against each other:

In [173]:
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.plot(times1, np.array(acc1)/10, "r")
plt.plot(times2, np.array(acc2)/10, "b")
plt.legend(["Built in", "From scratch"])
plt.xlabel("Time (s)")
plt.ylabel("% correct")
plt.grid(True)
plt.show()
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.plot(times1, losses1, "r")
plt.plot(times2, losses2, "b")
plt.legend(["Built in", "From scratch"])
plt.xlabel("Time (s)")
plt.ylabel("Loss")
plt.grid(True); pass

What if they execute at the same speed? Are there any accuracy degradation?

In [174]:
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.plot(times1, np.array(acc1)/10, "r")
plt.plot(times1, np.array(acc2)/10, "b")
plt.legend(["Built in", "From scratch"])
plt.xlabel("Scaled time (s)")
plt.ylabel("% correct")
plt.grid(True)
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.plot(times1, losses1, "r")
plt.plot(times1, losses2, "b")
plt.legend(["Built in", "From scratch"])
plt.xlabel("Scaled time (s)")
plt.ylabel("Loss")
plt.grid(True); pass

They're pretty close actually, but the one we've built from scratch is a little bit more accurate. I have not tested this many times, so it may just be random events.

So on average in independent testing and in real training, our convolution layer built from scratch is around 2 times slower as the built in layer. Not too shabby, considering people have spent a lot of time thinking about this and optimize the shit out of their systems, and convolutions are not the simplest of layers.

We still haven't dealt with custom strides, and the transpose convolution thingy too. Custom stride seems easy, but currently I can't think of a way to do it fast, and I'm still not sure how to recreate the transpose convolution behavior, so that should be a topic for another day.

Binary compatibility

Let's see if the weights inside the built in implementation sort of gives the same output or not. First let's define a function to detach all parameters from the gradient graph, so no history gets accumulated:

In [13]:
def stripRequiresGrad(net):
    for param in net.parameters():
        param.requires_grad_(False)
    return net

Also note that we possibly don't need to do this, but the complex inner workings of PyTorch makes me don't want to risk anything. Now, let's define 2 damn simple networks and strip away requires_grad:

In [22]:
net1 = stripRequiresGrad(nn.Conv2d(3, 4, 3).cuda())
net2 = stripRequiresGrad(Conv2d(3, 4).cuda())
print(f"Network 1's kernel size: {net1.weight.shape}")
print(f"Network 1's bias size: {net1.bias.shape}")
print(f"Network 2's kernel size: {net2.kernel.shape}")
print(f"Network 2's bias size: {net2.bias.shape}")
Network 1's kernel size: torch.Size([4, 3, 3, 3])
Network 1's bias size: torch.Size([4])
Network 2's kernel size: torch.Size([4, 3, 3, 3])
Network 2's bias size: torch.Size([4, 1, 1])

Pretty close to each other. The bias's shape is a little different, but that's easy to translate. Let's see the output of the networks:

In [23]:
img = torch.rand(1, 3, 4, 5).cuda()
print(f"Output of network 1: \n{net1(img)}\nShape: {net1(img).shape}\n")
print(f"Output of network 1: \n{net2(img)}\nShape: {net2(img).shape}")
Output of network 1: 
tensor([[[[ 0.2785,  0.2147,  0.4236],
          [ 0.3591,  0.6908,  0.0266]],

         [[-0.4474, -0.1876, -0.1753],
          [-0.3498, -0.5101, -0.2187]],

         [[ 0.6807,  0.7967,  0.8076],
          [ 0.6535,  0.7870,  0.5375]],

         [[ 0.0079, -0.3984,  0.1132],
          [-0.5219,  0.0539,  0.1182]]]], device='cuda:0')
Shape: torch.Size([1, 4, 2, 3])

Output of network 1: 
tensor([[[[ 0.2178, -0.1353,  0.0101],
          [-0.2325,  0.0305,  0.0577]],

         [[-0.0851, -0.0394, -0.1525],
          [-0.0978, -0.0924, -0.0584]],

         [[ 0.0074, -0.1416, -0.0730],
          [ 0.2378, -0.1673, -0.0738]],

         [[-0.1034, -0.2069, -0.1866],
          [-0.0293,  0.1014,  0.2213]]]], device='cuda:0')
Shape: torch.Size([1, 4, 2, 3])

Let's make the parameters the same, and see if their outputs is the same:

In [24]:
net1.weight.data = net2.kernel.data
net1.bias.data = net2.bias.data.squeeze()
print(f"Output of network 1: \n{net1(img)}\nShape: {net1(img).shape}\n")
print(f"Output of network 1: \n{net2(img)}\nShape: {net2(img).shape}")
Output of network 1: 
tensor([[[[ 0.2178, -0.1353,  0.0101],
          [-0.2325,  0.0305,  0.0577]],

         [[-0.0851, -0.0394, -0.1525],
          [-0.0978, -0.0924, -0.0584]],

         [[ 0.0074, -0.1416, -0.0730],
          [ 0.2378, -0.1673, -0.0738]],

         [[-0.1034, -0.2069, -0.1866],
          [-0.0293,  0.1014,  0.2213]]]], device='cuda:0')
Shape: torch.Size([1, 4, 2, 3])

Output of network 1: 
tensor([[[[ 0.2178, -0.1353,  0.0101],
          [-0.2325,  0.0305,  0.0577]],

         [[-0.0851, -0.0394, -0.1525],
          [-0.0978, -0.0924, -0.0584]],

         [[ 0.0074, -0.1416, -0.0730],
          [ 0.2378, -0.1673, -0.0738]],

         [[-0.1034, -0.2069, -0.1866],
          [-0.0293,  0.1014,  0.2213]]]], device='cuda:0')
Shape: torch.Size([1, 4, 2, 3])

Seems like they're the same. Let's just calculate the difference of them just to make sure:

In [25]:
torch.abs(net1(img) - net2(img)).sum()
Out[25]:
tensor(4.4331e-07, device='cuda:0')

Yep, they're exactly the same.